101 research outputs found

    EcID. A database for the inference of functional interactions in E. coli

    Get PDF
    The EcID database (Escherichia coli Interaction Database) provides a framework for the integration of information on functional interactions extracted from the following sources: EcoCyc (metabolic pathways, protein complexes and regulatory information), KEGG (metabolic pathways), MINT and IntAct (protein interactions). It also includes information on protein complexes from the two E. coli high-throughput pull-down experiments and potential interactions extracted from the literature using the web services associated to the iHOP text-mining system. Additionally, EcID incorporates results of various prediction methods, including two protein interaction prediction methods based on genomic information (Phylogenetic Profiles and Gene Neighbourhoods) and three methods based on the analysis of co-evolution (Mirror Tree, In Silico 2 Hybrid and Context Mirror). EcID associates to each prediction a specifically developed confidence score. The two main features that make EcID different from other systems are the combination of co-evolution-based predictions with the experimental data, and the introduction of E. coli-specific information, such as gene regulation information from EcoCyc. The possibilities offered by the combination of the EcID database information are illustrated with a prediction of potential functions for a group of poorly characterized genes related to yeaG. EcID is available online at http://ecid.bioinfo.cnio.es

    EcID. A database for the inference of functional interactions in E. coli

    Get PDF
    The EcID database (Escherichia coli Interaction Database) provides a framework for the integration of information on functional interactions extracted from the following sources: EcoCyc (metabolic pathways, protein complexes and regulatory information), KEGG (metabolic pathways), MINT and IntAct (protein interactions). It also includes information on protein complexes from the two E. coli high-throughput pull-down experiments and potential interactions extracted from the literature using the web services associated to the iHOP text-mining system. Additionally, EcID incorporates results of various prediction methods, including two protein interaction prediction methods based on genomic information (Phylogenetic Profiles and Gene Neighbourhoods) and three methods based on the analysis of co-evolution (Mirror Tree, In Silico 2 Hybrid and Context Mirror). EcID associates to each prediction a specifically developed confidence score. The two main features that make EcID different from other systems are the combination of co-evolution-based predictions with the experimental data, and the introduction of E. coli-specific information, such as gene regulation information from EcoCyc. The possibilities offered by the combination of the EcID database information are illustrated with a prediction of potential functions for a group of poorly characterized genes related to yeaG. EcID is available online at http://ecid.bioinfo.cnio.es

    The efficacy of various machine learning models for multi-class classification of RNA-seq expression data

    Full text link
    Late diagnosis and high costs are key factors that negatively impact the care of cancer patients worldwide. Although the availability of biological markers for the diagnosis of cancer type is increasing, costs and reliability of tests currently present a barrier to the adoption of their routine use. There is a pressing need for accurate methods that enable early diagnosis and cover a broad range of cancers. The use of machine learning and RNA-seq expression analysis has shown promise in the classification of cancer type. However, research is inconclusive about which type of machine learning models are optimal. The suitability of five algorithms were assessed for the classification of 17 different cancer types. Each algorithm was fine-tuned and trained on the full array of 18,015 genes per sample, for 4,221 samples (75 % of the dataset). They were then tested with 1,408 samples (25 % of the dataset) for which cancer types were withheld to determine the accuracy of prediction. The results show that ensemble algorithms achieve 100% accuracy in the classification of 14 out of 17 types of cancer. The clustering and classification models, while faster than the ensembles, performed poorly due to the high level of noise in the dataset. When the features were reduced to a list of 20 genes, the ensemble algorithms maintained an accuracy above 95% as opposed to the clustering and classification models.Comment: 12 pages, 4 figures, 3 tables, conference paper: Computing Conference 2019, published at https://link.springer.com/chapter/10.1007/978-3-030-22871-2_6

    Prediction of protein binding sites in protein structures using hidden Markov support vector machine

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Predicting the binding sites between two interacting proteins provides important clues to the function of a protein. Recent research on protein binding site prediction has been mainly based on widely known machine learning techniques, such as artificial neural networks, support vector machines, conditional random field, etc. However, the prediction performance is still too low to be used in practice. It is necessary to explore new algorithms, theories and features to further improve the performance.</p> <p>Results</p> <p>In this study, we introduce a novel machine learning model hidden Markov support vector machine for protein binding site prediction. The model treats the protein binding site prediction as a sequential labelling task based on the maximum margin criterion. Common features derived from protein sequences and structures, including protein sequence profile and residue accessible surface area, are used to train hidden Markov support vector machine. When tested on six data sets, the method based on hidden Markov support vector machine shows better performance than some state-of-the-art methods, including artificial neural networks, support vector machines and conditional random field. Furthermore, its running time is several orders of magnitude shorter than that of the compared methods.</p> <p>Conclusion</p> <p>The improved prediction performance and computational efficiency of the method based on hidden Markov support vector machine can be attributed to the following three factors. Firstly, the relation between labels of neighbouring residues is useful for protein binding site prediction. Secondly, the kernel trick is very advantageous to this field. Thirdly, the complexity of the training step for hidden Markov support vector machine is linear with the number of training samples by using the cutting-plane algorithm.</p

    Analysis of temporal transcription expression profiles reveal links between protein function and developmental stages of Drosophila melanogaster

    Get PDF
    Accurate gene or protein function prediction is a key challenge in the post-genome era. Most current methods perform well on molecular function prediction, but struggle to provide useful annotations relating to biological process functions due to the limited power of sequence-based features in that functional domain. In this work, we systematically evaluate the predictive power of temporal transcription expression profiles for protein function prediction in Drosophila melanogaster. Our results show significantly better performance on predicting protein function when transcription expression profile-based features are integrated with sequence-derived features, compared with the sequence-derived features alone. We also observe that the combination of expression-based and sequence-based features leads to further improvement of accuracy on predicting all three domains of gene function. Based on the optimal feature combinations, we then propose a novel multi-classifier-based function prediction method for Drosophila melanogaster proteins, FFPred-fly+. Interpreting our machine learning models also allows us to identify some of the underlying links between biological processes and developmental stages of Drosophila melanogaster

    Protein docking prediction using predicted protein-protein interface

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Many important cellular processes are carried out by protein complexes. To provide physical pictures of interacting proteins, many computational protein-protein prediction methods have been developed in the past. However, it is still difficult to identify the correct docking complex structure within top ranks among alternative conformations.</p> <p>Results</p> <p>We present a novel protein docking algorithm that utilizes imperfect protein-protein binding interface prediction for guiding protein docking. Since the accuracy of protein binding site prediction varies depending on cases, the challenge is to develop a method which does not deteriorate but improves docking results by using a binding site prediction which may not be 100% accurate. The algorithm, named PI-LZerD (using Predicted Interface with Local 3D Zernike descriptor-based Docking algorithm), is based on a pair wise protein docking prediction algorithm, LZerD, which we have developed earlier. PI-LZerD starts from performing docking prediction using the provided protein-protein binding interface prediction as constraints, which is followed by the second round of docking with updated docking interface information to further improve docking conformation. Benchmark results on bound and unbound cases show that PI-LZerD consistently improves the docking prediction accuracy as compared with docking without using binding site prediction or using the binding site prediction as post-filtering.</p> <p>Conclusion</p> <p>We have developed PI-LZerD, a pairwise docking algorithm, which uses imperfect protein-protein binding interface prediction to improve docking accuracy. PI-LZerD consistently showed better prediction accuracy over alternative methods in the series of benchmark experiments including docking using actual docking interface site predictions as well as unbound docking cases.</p

    Characterization of pathogenic germline mutations in human Protein Kinases

    Get PDF
    Background Protein Kinases are a superfamily of proteins involved in crucial cellular processes such as cell cycle regulation and signal transduction. Accordingly, they play an important role in cancer biology. To contribute to the study of the relation between kinases and disease we compared pathogenic mutations to neutral mutations as an extension to our previous analysis of cancer somatic mutations. First, we analyzed native and mutant proteins in terms of amino acid composition. Secondly, mutations were characterized according to their potential structural effects and finally, we assessed the location of the different classes of polymorphisms with respect to kinase-relevant positions in terms of subfamily specificity, conservation, accessibility and functional sites.&lt;p&gt;&lt;/p&gt; Results Pathogenic Protein Kinase mutations perturb essential aspects of protein function, including disruption of substrate binding and/or effector recognition at family-specific positions. Interestingly these mutations in Protein Kinases display a tendency to avoid structurally relevant positions, what represents a significant difference with respect to the average distribution of pathogenic mutations in other protein families.&lt;p&gt;&lt;/p&gt; Conclusions Disease-associated mutations display sound differences with respect to neutral mutations: several amino acids are specific of each mutation type, different structural properties characterize each class and the distribution of pathogenic mutations within the consensus structure of the Protein Kinase domain is substantially different to that for non-pathogenic mutations. This preferential distribution confirms previous observations about the functional and structural distribution of the controversial cancer driver and passenger somatic mutations and their use as a proxy for the study of the involvement of somatic mutations in cancer development.&lt;p&gt;&lt;/p&gt

    Protein-Protein Interaction Site Predictions with Three-Dimensional Probability Distributions of Interacting Atoms on Protein Surfaces

    Get PDF
    Protein-protein interactions are key to many biological processes. Computational methodologies devised to predict protein-protein interaction (PPI) sites on protein surfaces are important tools in providing insights into the biological functions of proteins and in developing therapeutics targeting the protein-protein interaction sites. One of the general features of PPI sites is that the core regions from the two interacting protein surfaces are complementary to each other, similar to the interior of proteins in packing density and in the physicochemical nature of the amino acid composition. In this work, we simulated the physicochemical complementarities by constructing three-dimensional probability density maps of non-covalent interacting atoms on the protein surfaces. The interacting probabilities were derived from the interior of known structures. Machine learning algorithms were applied to learn the characteristic patterns of the probability density maps specific to the PPI sites. The trained predictors for PPI sites were cross-validated with the training cases (consisting of 432 proteins) and were tested on an independent dataset (consisting of 142 proteins). The residue-based Matthews correlation coefficient for the independent test set was 0.423; the accuracy, precision, sensitivity, specificity were 0.753, 0.519, 0.677, and 0.779 respectively. The benchmark results indicate that the optimized machine learning models are among the best predictors in identifying PPI sites on protein surfaces. In particular, the PPI site prediction accuracy increases with increasing size of the PPI site and with increasing hydrophobicity in amino acid composition of the PPI interface; the core interface regions are more likely to be recognized with high prediction confidence. The results indicate that the physicochemical complementarity patterns on protein surfaces are important determinants in PPIs, and a substantial portion of the PPI sites can be predicted correctly with the physicochemical complementarity features based on the non-covalent interaction data derived from protein interiors

    Roles of residues in the interface of transient protein-protein complexes before complexation

    Get PDF
    Transient protein-protein interactions play crucial roles in all facets of cellular physiology. Here, using an analysis on known 3-D structures of transient protein-protein complexes, their corresponding uncomplexed forms and energy calculations we seek to understand the roles of protein-protein interfacial residues in the unbound forms. We show that there are conformationally near invariant and evolutionarily conserved interfacial residues which are rigid and they account for ∼65% of the core interface. Interestingly, some of these residues contribute significantly to the stabilization of the interface structure in the uncomplexed form. Such residues have strong energetic basis to perform dual roles of stabilizing the structure of the uncomplexed form as well as the complex once formed while they maintain their rigid nature throughout. This feature is evolutionarily well conserved at both the structural and sequence levels. We believe this analysis has general bearing in the prediction of interfaces and understanding molecular recognition

    Sequence-based identification of interface residues by an integrative profile combining hydrophobic and evolutionary information

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Protein-protein interactions play essential roles in protein function determination and drug design. Numerous methods have been proposed to recognize their interaction sites, however, only a small proportion of protein complexes have been successfully resolved due to the high cost. Therefore, it is important to improve the performance for predicting protein interaction sites based on primary sequence alone.</p> <p>Results</p> <p>We propose a new idea to construct an integrative profile for each residue in a protein by combining its hydrophobic and evolutionary information. A support vector machine (SVM) ensemble is then developed, where SVMs train on different pairs of positive (interface sites) and negative (non-interface sites) subsets. The subsets having roughly the same sizes are grouped in the order of accessible surface area change before and after complexation. A self-organizing map (SOM) technique is applied to group similar input vectors to make more accurate the identification of interface residues. An ensemble of ten-SVMs achieves an MCC improvement by around 8% and F1 improvement by around 9% over that of three-SVMs. As expected, SVM ensembles constantly perform better than individual SVMs. In addition, the model by the integrative profiles outperforms that based on the sequence profile or the hydropathy scale alone. As our method uses a small number of features to encode the input vectors, our model is simpler, faster and more accurate than the existing methods.</p> <p>Conclusions</p> <p>The integrative profile by combining hydrophobic and evolutionary information contributes most to the protein-protein interaction prediction. Results show that evolutionary context of residue with respect to hydrophobicity makes better the identification of protein interface residues. In addition, the ensemble of SVM classifiers improves the prediction performance.</p> <p>Availability</p> <p>Datasets and software are available at <url>http://mail.ustc.edu.cn/~bigeagle/BMCBioinfo2010/index.htm</url>.</p
    corecore